-
-
Notifications
You must be signed in to change notification settings - Fork 31
[ENH]: parallel implementation of is_reachable()
with numpy arrays.
#119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
something on the similar lines(in terms of enhancing tournament algorithms)-- we can also consider taking this reusing the pool of workers approach maybe for tournament_is_strongly_connected (in line 80) -- and that might show some more improvements in terms of speedups-- but all our functions are currently wrapped in a PS: this doesn't have to be handled in this PR-- just something we can think about in the long run. |
nx_parallel/algorithms/tournament.py
Outdated
return all(tnc) | ||
|
||
def is_closed(G, nodes): | ||
return all(v in G[u] for u in set(G) - nodes for v in nodes) | ||
def is_closed(adjM, nodes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try checking the speed with and without this function. That is, inline the contents of the function into line 42 so Python can avoid a function call.
Another possible speed-up could be to put the simple checks first in the compound logical statement in line 42. That should avoid the is_closed
code being run at all unless s in S and t not in S.
Do we know where this is taking most of its time? I was just guessing for these suggestions based on the for loop structure. Using the ipython tool %prun
you might be able to tell where the code spends the most time -- but I don't know how that works with parallel joblib. Maybe try it with 1 cpu or with the parallel code part removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try checking the speed with and without this function. That is, inline the contents of the function into line 42 so Python can avoid a function call.
I tried adding this logic to line 42, but it didn’t result in any noticeable speedups. So I’ve decided to keep it as a separate function for better readability. Pls let me know if you think otherwise.
Do we know where this is taking most of its time? I was just guessing for these suggestions based on the for loop structure. Using the ipython tool
%prun
you might be able to tell where the code spends the most time -- but I don't know how that works with parallel joblib. Maybe try it with 1 cpu or with the parallel code part removed.
I removed the parallel execution and ran %prun
using the following input:
G = nx.algorithms.tournament.random_tournament(1600)
s, t = 0, 1599
A summary of the output:
ncalls tottime percall cumtime percall filename:lineno(function)
3/1 0.000 0.000 5.383 5.383 {built-in method builtins.exec}
1 0.000 0.000 5.383 5.383 <string>:1(<module>)
1 0.000 0.000 5.383 5.383 decorators.py:17(wrapper)
1 0.000 0.000 5.383 5.383 tournament.py:14(is_reachable)
1 0.728 0.728 4.324 4.324 tournament.py:30(two_neighborhood_close)
1279201 0.296 0.000 2.851 0.000 {built-in method builtins.any}
6404724 0.878 0.000 2.674 0.000 tournament.py:40(<genexpr>)
10230374 2.418 0.000 2.418 0.000 memmap.py:357(__getitem__)
1 0.000 0.000 1.056 1.056 decorators.py:783(func)
1 0.000 0.000 1.056 1.056 <class 'networkx.utils.decorators.argmap'> compilation 21:1(argmap_to_numpy_array_18)
1 0.000 0.000 1.042 1.042 backends.py:538(_call_if_any_backends_installed)
1 0.006 0.006 1.042 1.042 backends.py:1460(_call_with_backend)
1 0.484 0.484 1.035 1.035 convert_matrix.py:881(to_numpy_array)
It is evident from this that the bulk of time is spent computing two-hop neighbors and the memory access via memmap
. I've tried other alternatives you mentioned, but this seems to perform the best.
Hi @Schefflera-Arboricola and @dschult, I'm trying to run asv benchmarks for |
How is your environment set up? "conda/mamba" or "venv + pip"? |
Apologies for the oversight! I was able to figure it out from the documentation. Thank you @dschult! |
I tried out the following benchmarks:
========== ============ ========== =========== ============ ============
-- num_nodes
---------- -------------------------------------------------------------
backend 50 100 200 400 800
========== ============ ========== =========== ============ ============
parallel 6.19±0.4ms 14.5±2ms 44.7±1ms 166±1ms 655±2ms
networkx 18.5±3ms 74.3±3ms 288±0.7ms 1.13±0.01s 4.53±0.03s
========== ============ ========== =========== ============ ============
========== ============ ========== ============ ========== ============
-- num_nodes
---------- ------------------------------------------------------------
backend 50 100 200 400 800
========== ============ ========== ============ ========== ============
parallel 1.15±0.1ms 5.00±1ms 17.5±0.3ms 67.8±1ms 270±0.7ms
networkx 18.3±1ms 73.7±1ms 289±3ms 1.12±0s 4.59±0.03s
========== ============ ========== ============ ========== ============
The pure Python implementation performs better. I haven't committed the pure Python code yet because I wanted to confirm a few things first. Do these benchmarks imply that the process of converting a graph into a numpy matrix is more expensive than that of passing the huge graph to each core? Because |
Pure Python parallelism after the recent Networkx commit, for reference: def two_neighborhood_close(G, chunk):
for v in chunk:
v_adj = G._adj[v]
S = {
x
for x, x_pred in G._pred.items()
if x == v or x in v_adj or any(z in v_adj for z in x_pred)
}
if s in S and t not in S and is_closed(G, S):
return True
return False
def is_closed(G, S):
return all(u in S or all(v in unbrs for v in S) for u, unbrs in G._adj.items())
if hasattr(G, "graph_object"):
G = G.graph_object
n_jobs = nxp.get_n_jobs()
if get_chunks == "chunks":
node_chunks = nxp.chunks(G, n_jobs)
else:
node_chunks = get_chunks(G)
return not any(
Parallel()(delayed(two_neighborhood_close)(G, chunk) for chunk in node_chunks)
) |
Which version of config is used when Can you point me to your benchmarking code? Is it already merged in nx-parallel, or is it only on your local machine? |
I didn’t modify the number of jobs anywhere in the codebase, so by default,
The local changes I made were:
|
is_reachable()
yielding better speedups (ref comment).G
across multiple cores via joblib's memmapping.